Machine Learning on Amazon Retail Data
  • Code
  • By Bhavana
  1. Data Prep / EDA
  • Home
  • Introduction
  • Data Prep / EDA
  • Models and Methods
    • ARM (Association Rule Mining)
    • Naive Bayes
    • Clustering
    • Decision Trees
    • Neural Networks
    • Regression
    • SVM (Support Vector Machine)
  • Conclusions

On this page

  • Data Collection
  • Data Cleaning
  • Data Preprocessing / Visualization
    • Investigation of products in both sets of data

Data Prep / EDA

Where the data source, processing, and visualization (EDA) is presented.

Data Collection

Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.

import requests

payload = {
   'api_key': 'API_KEY',
   'query': 'iphone 15 charger',
   's': 'price-asc-rank'
}

response = requests.get('https://api.scraperapi.com/structured/amazon/search',
                        params=payload).json()

The jupyter notebook code for the web scraping can be found here.

Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).

The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.

Table 1: The raw data from both datasets.
(a) The raw data scraped from Amazon using ScraperAPI
type position asin name image has_prime is_best_seller is_amazon_choice is_limited_deal stars total_reviews url availability_quantity spec price_string price_symbol price original_price section_name
0 search_product 26 B006IYTFIU DCMen's Court Graffik Casual Low Top Skate Sho... https://m.media-amazon.com/images/I/715fiNSHyo... True False False False 4.7 19276.0 https://www.amazon.com/DC-Court-Graffik-Skate-... NaN {} $62.99 $ 62.99 NaN NaN
1 search_product 19 B0CS9N2XMY Mousepad For G502X Wireless Mouse Glides Mouse... https://m.media-amazon.com/images/I/61ve+2Y6HZ... False False False False NaN NaN https://www.amazon.com/Mousepad-Wireless-Glide... NaN {} $3.37 $ 3.37 NaN NaN
2 search_product 13 B0CVBMWK1Q USB C Hub, 8 in 1 Multi Port Adapter, USB 3.0 ... https://m.media-amazon.com/images/I/61dT7o5PWU... False False False False NaN NaN https://www.amazon.com/dp/B0CVBMWK1Q/ref=sr_1_... NaN {} $13.69 $ 13.69 NaN NaN
3 search_product 22 B01JQSP498 Amazon.com Gift Card in a Hello Baby Reveal https://m.media-amazon.com/images/I/81IlADjS7a... True False False False 4.9 4030.0 https://www.amazon.com/Amazon-com-Gift-Card-He... NaN {} $15.00 $ 15.00 NaN NaN
4 search_product 14 B0C97ZTZ7K Cuisinart 14 Cup Food Processor & Coffee Maker... https://m.media-amazon.com/images/I/51TxHJv26X... False False False False 4.6 19012.0 https://www.amazon.com/Cuisinart-Processor-Res... NaN {} $364.62 $ 364.62 {'price_string': '$399.90', 'price_symbol': '$... NaN
(b) The raw data gotten from Kaggle
asin title imgUrl productURL stars reviews price listPrice category_id isBestSeller boughtInLastMonth
0 B0B1D76NZ8 Clear Bag Stadium Approved PVC Clear Purse Cle... https://m.media-amazon.com/images/I/61-Ekp8ZzR... https://www.amazon.com/dp/B0B1D76NZ8 3.2 0 12.99 0.0 9 False 0
1 B0B64WDBMP Pirate Costume Boys Pirate Pretend Set Hallowe... https://m.media-amazon.com/images/I/71vPgX9XiQ... https://www.amazon.com/dp/B0B64WDBMP 4.1 0 18.99 0.0 225 False 50
2 B0BTK5TMRZ Brother TD-4520DN Entry Level Direct Thermal D... https://m.media-amazon.com/images/I/51ahcrj74s... https://www.amazon.com/dp/B0BTK5TMRZ 0.0 0 423.99 0.0 81 False 0
3 B08WZXG9KD Water Flosser Non-Electric Portable Affordable... https://m.media-amazon.com/images/I/61Hc+A-l88... https://www.amazon.com/dp/B08WZXG9KD 4.0 0 9.99 0.0 45 False 100
4 B0CFQXBCC6 Xohny Cat Ball Launcher Toy, 50 Pcs Colorful a... https://m.media-amazon.com/images/I/61tmocfqMu... https://www.amazon.com/dp/B0CFQXBCC6 0.0 0 17.99 0.0 179 False 0

Data Cleaning

The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.

The steps to clean the web-scaped data were:

  • Add date_scraped column
  • Remove unecessary columns: type, position, has_prime, is_amazon_choice, is_limited_deal, availability_quantity, spec, price_string, price_symbol, section_name
  • Expand and fix original_price
  • Rename columns to match standard snake case for merging both datasets
  • Drop rows with no asin or name or price
  • Drop rows with price of 0.0, since that doesn’t make sense
  • Fill NaN reviews column with 0

The steps to clean the Kaggle data were:

  • Add date_scraped column
  • Remove unecessary columns boughtInLastMonth
  • Drop rows with any NaNs
  • Fix list_price of $0 to be instead equal to price
  • Change category_id to actual category by using category table
  • Drop rows with price of $0, since that doesn’t make sense
  • Rename columns to match standard snake case for merging both datasets

And then, after they were concatenated, the steps to clean were:

  • Remove duplicates (by asin + date scraped)
  • Rename columns

The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):

Table 2: The final unioned, cleaned, and processed data.
Asin Name Image Url Is Best Seller Stars Reviews Url Price Date Scraped List Price Category
0 B07D8SC8CH Little Girls Summer Outfits Floral Tank Top an... https://m.media-amazon.com/images/I/71H3ZfNqTy... False 4.5 3530.0 https://www.amazon.com/dp/B07D8SC8CH 16.99 2023-11-01 16.99 Girls' Clothing
1 B09K7C43GP Winter Pompom Beanie Hats for Women Fluffy Kni... https://m.media-amazon.com/images/I/61bpGUCBxA... False 4.4 0.0 https://www.amazon.com/dp/B09K7C43GP 21.99 2023-11-01 21.99 Women's Accessories
2 B0CDK1VXN9 SAFIGLE Ostrich Marionettes String Puppets Flu... https://m.media-amazon.com/images/I/61xdQtrHJA... False 0.0 0.0 https://www.amazon.com/dp/B0CDK1VXN9 8.89 2023-11-01 8.89 Puppets & Puppet Theaters
3 B07CMYDGRL hobbysoul 12mm x 25M Fibre-Reinforced Black Ba... https://m.media-amazon.com/images/I/612woLbDNS... False 5.0 5.0 https://www.amazon.com/dp/B07CMYDGRL 14.00 2023-11-01 14.00 Industrial Adhesives, Sealants & Lubricants
4 B000BNG4VU Coty Airspun Loose Face Powder, Translucent, P... https://m.media-amazon.com/images/I/81bIKDGPRk... False 4.5 112738.0 https://www.amazon.com/dp/B000BNG4VU 12.07 2023-11-01 12.83 Makeup

The code for the data cleaning can be found here.

Data Preprocessing / Visualization

Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.

Important

If the interactive figures don’t load, dont worry: just turn off all ad-blockers/privacy browsing, make sure you are using Chrome/Firefox, and refresh the page until all figures load.

Figure 1

Figure 1: A histogram of all categories of all Amazon products. Note scraped data did not have categories, but the Kaggle data did.

Figure 2

Figure 2: Stars vs number of reviews recieved by an amazon product, colored by whether the product was a best-seller.

Figure 3

Wordcloud for categories of products

Wordcloud for the names of products
Figure 3: Wordclouds (where more frequent appearing words are bigger) of the categories of products and the names of products.

Investigation of products in both sets of data

The following plots focused on products that had the same Asin (identifying ID), that were in both the data personally scraped in 2024 and the Kaggle data from 2023. Investigating this subset of data could reveal a lot about how product prices, reviews, and other factors changed over time, as the only way to get time difference data was to use products that actually had more than one timepoint.

Figure 4

Figure 4: Price vs list price of items with the same ASIN across dates scraped, with a trendline.

Given that we can see outliers in price affecting the plot of the graph, it was decided for analysis to only consider those prices most populous, aka prices less than $800.

Figure 5

Figure 5: Histogram of prices, colored by whether the change in price increased or decreased over time, for those items that were in both sets of data.

Figure 6

Figure 6: Price vs the difference in price, over the two sets of data, colored by whether the price diff increased or decreased.

Only categories with more than 20 products were included, to reduce noise and to focus on meaningful categories. Figure 7

Figure 7: For the included categories, a bar chat showing the ratio of products that had their price increase over time vs decrease.
Introduction
ARM (Association Rule Mining)